-
Notifications
You must be signed in to change notification settings - Fork 52
fix: prevent S3 path conflicts using tempfile #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does this fix the file not found errors we sometimes see as an S3 source? |
| expected_filenames.sort() | ||
| actual_filenames.sort() | ||
| assert expected_filenames == actual_filenames, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.
I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).
For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)
expected_filenames = {Path(s3_key).name for s3_key in s3_keys}
actual_filenames = {Path(download_file).name for download_file in download_files}
assert expected_filenames == actual_filenames
| if not file_data.source_identifiers: | ||
| return None | ||
|
|
||
| filename = file_data.source_identifiers.filename | ||
| if not filename: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define both booleans as variables, join them with an and and return None once
| mkdir_concurrent_safe(self.download_dir) | ||
|
|
||
| temp_dir = tempfile.mkdtemp( | ||
| prefix="unstructured_", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this a class-level constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits but otherwise LGTM!
Problem
S3 downloads were sometimes failing with
NotADirectoryErrorandFileExistsErrorwhen S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.Example conflict:
foo(file)foo/documents(file requiring foo to be a directory)This created a race condition where download order determined success/failure
Solution
Used tempfile to create unique download paths for each S3 object:
Before:
After:
Future Work
This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.